4.Main Analysis
4.1 Which accomdation type is most popular?
Rating score is a effective way to discover popularity, we first want to analyze the quality of rating scores of this data set Firstly, we assume that review score is critical, it is an indicator for a nice host, which is valued for Airbnb customer.
ggplot(airbnb, aes(x = review_scores_rating)) +
geom_histogram(binwidth = 1, color = 'black', fill = "lightblue") +
ggtitle("Histogram of Review Scores Rating") +
xlab("Review Scores Rating") +
ylab("Count") +
theme(plot.title = element_text(hjust = 0.5))

There are 1148 observations with missing value. Most listings’ scores are over 80 and the mode is 100. This data also have rounding patterns. When scores are less than 60, only values of 20, 40, 50 and 60 are found.
When people are rating, they tend to give same scores on all aspects, so there are rounding patterns. People are more likely to rate their stay if thery are pretty satisfied than they would have otherwise Therefore, there are many full scores.
ggplot(airbnb, aes(x = number_of_reviews, y = review_scores_rating)) +
geom_point(stroke = 0, alpha = 0.3, color = 'blue') +
ggtitle("Number of Reviews v.s. Review Scores Rating") +
xlab("Number of Reviews") +
ylab("Review Scores Rating") +
theme(plot.title = element_text(hjust = 0.5))

We can see from the scatterplot that there are few listings with many reviews(over 100) and a low review score(less than 80). When people are choosing where to stay, they prefer listings with higher rating scores, so high-score listings will in general get more reviews. When number of reviews surpass certain threshold(aproximately over 300), the score tend to be higher which indicate some consistently good host.
Since the quality of reviews in this data set meet our expectation, we can use this fact to analyze how people think on different type of accomodations.
ggplot(airbnb, aes(x=review_scores_rating,group = room_type, color = room_type)) +
geom_density(alpha = .3) +
ggtitle("Density Curve of Review Scores Rating Group by Room Type") +
xlab("Review Scores Rating") +
ylab("Density") +
scale_color_discrete(name = "Room Type") +
theme(plot.title = element_text(hjust = 0.5))

Above is a graph showing review score distribution for three different type of accommodations, entire home, private room and shared room, we want to find out which type of room is favored by travelers. The rating score density is getting sharper when people have more private space, which clearly stated that people will be more satisfied when they interact less with others. Compare to interact with strangers, either another traveler (shared room) or household (private), it’s always better to have a quiet family, or alone time after a day of trip when you have control of the entire home. Giving customer more personal space would be a very positive factor to make them satisfied.
4.2 Whats the most effective driver for the listing price?
4.2.1 The first thing comes to our mind is the review score, is a nice host has the power to charge more?
ggplot(airbnb, aes(x = price, y = review_scores_rating)) +
geom_point(stroke = 0, alpha = 0.3, color = 'blue') + xlim(0,2500) +
ggtitle("Price v.s. Review Scores Rating") +
xlab("Price") +
ylab("Review Scores Rating") +
theme(plot.title = element_text(hjust = 0.5))

Clearly, very few points laid in the lower right part of the graph, which indicates that people stay in higher priced rooms are very less likely to be unsatisfied since they are more probable to have a good experience with nicer room quality or better service. However, we can not be affirmative that high review scores can drive price up since there rooms with lower price also bring high review scores.
Next, we assumed that customer would love a larger space so they can stay more comfortably.
ggplot(SFPrice, aes(x = square_feet,y = price)) +
geom_point(alpha = 0.3, color = "blue",stroke = 0) +
geom_density_2d(color = "maroon") +
ylim(0,700) +
xlab("Square Feet") +
ylab("Price") +
ggtitle("Price v.s Square Feet") +
theme(plot.title = element_text(hjust = 0.5)) +
theme_classic() +
theme(plot.title = element_text(hjust = 0.5))

Above figure depicts the price distribution in San Francisco based on Square Feet, we cannot suggest a significant relationship between room size and price. What we can infer from the graph is that there is a positive correlation between price and room size but the correlation is not significant.
4.2.2 Description
Most of hosts will write a description for their listings, and we assume that people would value what they say in the description. We plot a word cloud to see what are they talking about the most
set.seed(123)
wordcloud(words = word_counts_description$words, freq = word_counts_description$freq, min.freq = 1,
max.words=200, random.order=FALSE, rot.per=0.35,
colors=brewer.pal(8, "Dark2"))

We will pick “private” and “kitchen” to do further analysis.
Then we would control on some keyword to see if there’s a effect on it. We processed each of the decescription to mine the keywords and devide the data into 2 groups to check the effect.
SFPrice_Description$private <- paste("private", SFPrice_Description$private)
g1 = SFPrice_Description%>%
mutate(description = fct_reorder(as.factor(private),desc(price), fun = median)) %>%
ggplot(aes(x =price, y = as.factor(private),fill = ..x..)) +
geom_density_ridges_gradient(scale = 3, rel_min_height = 0.01) +
scale_fill_distiller(name = "Price", palette = "GnBu") +
xlim(0,1000) +
xlab("Price") +
ylab("Description") +
ggtitle("Distribution of Each Key Word Respectively in Description") +
theme(plot.title = element_text(hjust = 0.5))
SFPrice_Description$kitchen <- paste("kitchen", SFPrice_Description$kitchen)
g2 = SFPrice_Description%>%
mutate(description = fct_reorder(as.factor(kitchen),desc(price), fun = median)) %>%
ggplot(aes(x =price, y = as.factor(kitchen),fill = ..x..)) +
geom_density_ridges_gradient(scale = 3, rel_min_height = 0.01) +
scale_fill_distiller(name = "Price", palette = "GnBu") +
xlim(0,1000) +
xlab("Price") +
ylab("Description")
grid.arrange(g1,g2,nrow = 2)

Price distribution conditional on whether “private” or “kitchen” is being mentioned in the room description or not. From the ridgeline plot, we cannot see definite difference between the price distribution of the two.
check_d = c(colnames(SFPrice_Description)[99:100])
Data_Des <-SFPrice_Description %>% gather(key = description, value,check_d)
Data_Des$description <- Data_Des$value
ggplot(Data_Des,aes(x = description, y = price, fill = description)) +
geom_boxplot() +
ggtitle("Price v.s Description") +
scale_x_discrete(name = "Description") +
theme(plot.title = element_text(hjust = 0.5)) +
ylim(0,500) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))

Boxplot to analyze whether there is a significant difference in the median price conditional on whether word kitchen or “private” is included in the description or not. A boxplot pick out some subtle difference bettwen the median price of rooms include “private” in the descrption and those that don’t include “private”
4.2.3 Cleaning Fee
We assume that customers of Airbnb is price sensitive, and so does the cleaning fee. We want to know if there is a relationship between cleaning fee and the total price.
SFPrice = fread("listings.csv",header = T, sep = ',')
SFPrice$cleaning_fee = as.numeric(gsub('[$,]', '', SFPrice$cleaning_fee))
SFPrice$price = as.numeric(gsub('[$,]', '', SFPrice$price))
ggplot(SFPrice, aes(x = cleaning_fee,y = price)) +
geom_point(alpha = 0.3, color = "blue",stroke = 0) +
geom_density_2d(color = "maroon") +
ylim(0,700) +
xlim(0,300) +
xlab("Cleaning Fee") +
ylab("Price") +
ggtitle("Price v.s Cleaning Fee") +
theme(plot.title = element_text(hjust = 0.5)) +
theme_classic()

Above figure depicts the price distribution in San Francisco based on cleaning fee. It appears that the result is very similar to that of room size, a insignficant positive correlation.
4.2.4 Location.
We break the listings by zip code and see if location drives the price.
SFPrice <- SFPrice[!is.na(SFPrice$zipcode)]
SFPrice%>%
mutate(zipcode = fct_reorder(as.factor(zipcode),desc(price), fun = median)) %>%
ggplot(aes(x =price, y = as.factor(zipcode),fill = ..x..)) +
geom_density_ridges_gradient(scale = 3, rel_min_height = 0.01) +
scale_fill_distiller(name = "Price", palette = "GnBu")+
xlim(0,1000) +
ylab("Zipcode") +
xlab("Price") +
ggtitle("Zipcode Distribution Comparison") +
theme(plot.title = element_text(hjust = 0.5))

Price distribution group by zipcode. From the figure, we can deduce that zip code play a essentail role in determing the pricing of room rentals.
SFPrice <- SFPrice[!is.na(SFPrice$zipcode)]
SFPrice%>%
mutate(zipcode = fct_reorder(as.factor(zipcode),desc(price), fun = median)) %>%
ggplot(aes(x = zipcode, y = price, fill = zipcode)) +
geom_boxplot() +
ggtitle("Zipcode v.s Price") +
scale_x_discrete(name = "Zip Code") +
theme(plot.title = element_text(hjust = 0.5)) +
ylim(0,850) +
theme(axis.text.x = element_text(angle = 90, hjust = 1))

A boxplot further confirmed our gusses. The price distribution conditional on zip code varies relatively more than it conditional on room size and room description.
medians <- airbnb %>% group_by(zipcode) %>%
summarize(median = median(na.omit(price))) %>%
transmute(region = zipcode, value = median)
medians$region <- as.character(medians$region)
medians <- na.omit(medians)
medians <- subset(medians, region != 94106 & region != 94113 & region != 94965 &
region != 94510 & region != 94014)
zip_choropleth(medians, county_zoom = 6075, num_colors = 6,
title = "Median Price by Zipcode") +
scale_fill_brewer(palette = "GnBu", na.value = "white",
guide_legend(title = "Median Price"),
labels = c("89-125","125-150","150-156","156-178","178-180","180-500","No Data"))
## Scale for 'fill' is already present. Adding another scale for 'fill',
## which will replace the existing scale.

We can observe that the dark, pricy areas are clustered at the downtown area, which is a great way to indicate that how location affect the median price.
4.3 How does location affect the price?
4.3.1 Famous attractions Since most of the tourist come to San Francisco will definatly visit the attraction points like Fisherman’s Wharf, Lambo….. we assume that being closer to these places will drive the prices up, we found the coordinate of these places and calculate the cartesian distance for each room, and see if the distance will affect pricing.
SFPrice_Sight$price = as.numeric(gsub('[$,]', '', SFPrice_Sight$price))
ggplot(SFPrice_Sight, aes(y = price,x = `Shortest Distance`)) +
ggtitle("Price v.s Shortest Distance to Four Famous Sight") +
theme(plot.title = element_text(hjust = 0.5)) +
geom_point(alpha = 0.3, color = "blue",stroke = 0) +
geom_density_2d(color = "maroon") +
ylim(0,750) +
xlab("Price") +
ylab("Shortest Distance") +
theme_classic()

Above figure depicts the price distribution in San Francisco based on their shortest distance to four of the most famous sight(“Fisherman’s Wharf”,“Lombart St”,“Union Square”,“Golden Gate Bridge”) in SF.House rent on airbnb SF tend to be traveling oriented as what we can see from the graph that most of the data points are very close to four of the most famous sight in SF and price are generally lower for the rooms that are far away from those famous sights
library(ggrepel)
A1 = aggregate(SFPrice_Sight[, c(62,98)], list(SFPrice_Sight$zipcode), median)
names(A1)[1] = "Zipcode"
A1$Zipcode = as.factor(A1$Zipcode)
A1 = A1[!A1$Zipcode %in% factor("94113"),]
ggplot(A1, aes(x =`Shortest Distance`, y= price)) +
geom_point(aes(color = Zipcode)) +
theme(plot.title = element_text(hjust = 0.5)) +
scale_color_viridis(discrete=TRUE) +
theme_bw() +
ggtitle("Median Price v.s Shortest Distance to Famoust Sight Based On Zipcode") +
xlab("Shortest Distance") +
ylab("Price") +
theme(plot.title = element_text(hjust = 0.5)) +
geom_label_repel(aes(label = Zipcode),box.padding = 0.8, point.padding = 1,segment.color = 'grey50') +
theme_classic()

Above figure depicts median price of each region based on Shortes Distance to the four famous sight in San Francisco. It can be clearly observed that zip code 94104 has highest median price and can be considered as an outlier. After further investigation, we discovered zip code 94104 is right next to Financial district and union square.Locating at the heart of San Francisco grant room owners the previlage to list their rooms on airbnb with exceptionally expensive prices.
4.3.2 Transporation Transporation is much more advanced today, hoop on a public bus or train can take you to most of the place in San Francisco, we want to know if having access to these tranportation affect the pricing? To determine the importance of transportation on price we target the transit description wrote by the host, apply text mining on it and try to extract critical information.
- Parsing transit description
- Applied methods from nltk library to tokenize each word, remove the fillers and analyze the usefulness of the word in the paragraph.
- Get total transportation description word frequenct count
- Construct anlysis based on different price range
set.seed(1234)
wordcloud(words = word_counts_transit$words, freq = word_counts_transit$freq, min.freq = 1,
max.words=200, random.order=FALSE, rot.per=0.35,
colors=brewer.pal(8, "Dark2"))

Figure above reveals high frequency words that San Francisco room rental owner like to use when describe how accesible the tranportation is from their renting place. As what most people expected, those high frequency word include bus, parking, bart, train, uber, shuttle etc.
We also want to know the relationship of tranportation method and pricing level, so we break the listings into 5 grades and dig into each of them. We parse their description about transit and find the mehods that are mentioned the most:
data$group = as.factor(data$group)
empty_bar=4
to_add = data.frame( matrix(NA, empty_bar*nlevels(data$group), ncol(data)) )
colnames(to_add) = colnames(data)
to_add$group=rep(levels(data$group), each=empty_bar)
data=rbind(data, to_add)
data=data %>% arrange(group)
data$id=seq(1, nrow(data))
label_data=data
number_of_bar=nrow(label_data)
angle= 90 - 360 * (label_data$id-0.5) /number_of_bar
label_data$hjust<-ifelse( angle < -90, 1, 0)
label_data$angle<-ifelse(angle < -90, angle+180, angle)
ggplot(data, aes(x=as.factor(id), y=value, fill=group)) +
geom_bar(stat="identity", alpha=0.5) +
ylim(-0.5,1.2) +
theme_minimal() +
theme(
axis.text = element_blank(),
axis.title = element_blank(),
panel.grid = element_blank(),
plot.margin = unit(rep(-1,4), "cm")
) +
coord_polar() +
geom_text(data=label_data, aes(x=id, y=value, label=word, hjust=hjust), color="black", fontface="bold",alpha=0.6, size=3, angle= label_data$angle, inherit.aes = FALSE ) +
scale_fill_discrete(breaks = c("0~50","50~100","100~150","150~200",">200"), name = "Price Range")

Figure abve displays the top 11 most frequent words used in their transit description group by price range. We can find that parking is mentioned most for high price gourp which inferred that people paying expensive rooms are more likly to be driving, and BART(Bay Area Rapid Transit, a inter city train system) is mentioned the most in the low price group which can indicate that people paying less are more likely to stay closer to the train.
However, these information are good sources to prove causality.
Then we go further with the text mining, we divide listings into pair of groups by mentioning some transit method or not to see the effect of each trannsporation method on price.
SFPrice = fread("priceNew.csv",header = T, sep = ',')
SFPrice$bus <- paste("bus",SFPrice$bus)
G1 = ggplot(SFPrice,aes(x =price, y = as.factor(bus),fill = ..x..)) +
geom_density_ridges_gradient(scale = 3, rel_min_height = 0.01) +
scale_fill_distiller(name = "Price", palette = "GnBu") +
theme(plot.title = element_text(hjust = 0.5)) +
xlim(0,750) + ggtitle("Include Bus in Transportation respetively vs Not")+
xlab("Price") + ylab("Transportation")
SFPrice$bart <- paste("bart",SFPrice$bart)
G2 = ggplot(SFPrice,aes(x =price, y = as.factor(bart),fill = ..x..)) +
geom_density_ridges_gradient(scale = 3, rel_min_height = 0.01) +
scale_fill_distiller(name = "Price", palette = "GnBu") +
theme(plot.title = element_text(hjust = 0.5)) +
xlim(0,750) + ggtitle("Include Bart in Transportation respetively vs Not")+
xlab("Price") + ylab("Transportation")
SFPrice$shuttle <- paste("shuttle",SFPrice$shuttle)
G3 = ggplot(SFPrice,aes(x =price, y = as.factor(shuttle),fill = ..x..)) +
geom_density_ridges_gradient(scale = 3, rel_min_height = 0.01) +
scale_fill_distiller(name = "Price", palette = "GnBu") +
theme(plot.title = element_text(hjust = 0.5)) +
xlim(0,750) + ggtitle("Include Shuttle in Transportation respetively vs Not")+
xlab("Price") + ylab("Transportation")
SFPrice$train <- paste("train",SFPrice$train)
G4 = ggplot(SFPrice,aes(x =price, y = as.factor(train),fill = ..x..)) +
geom_density_ridges_gradient(scale = 3, rel_min_height = 0.01) +
scale_fill_distiller(name = "Price", palette = "GnBu") +
theme(plot.title = element_text(hjust = 0.5)) +
xlim(0,750) + ggtitle("Include Train in Transportation respetively vs Not")+
xlab("Price") + ylab("Transportation")
grid.arrange(G1,G2,G3,G4,nrow = 4)

Figure above displays the distribution of price distinction conditional on the four most prevailing transportaions in San Francisco. Although,each pair of specific transportation mean do exhibit some difference,the difference between are rather subtle and cannot be taken as an reasonable factor to affect the price of room listings on Airbnb.
SFPrice = fread("priceNew.csv",header = T, sep = ',')
check = c(colnames(SFPrice)[c(99,100,101,104)])
Data <-SFPrice %>% gather(key = transportation, value,check)
Data$Transportation <- paste(Data$transportation,Data$value)
Data%>%
mutate(Transportation = fct_reorder(as.factor(Transportation), desc(price), fun = median)) %>%
ggplot(aes(x =as.factor(Transportation), y = price,fill = Transportation)) +
geom_boxplot() +
ggtitle("Price v.s Transportation") +
scale_x_discrete(name = "Transportation") +
theme(plot.title = element_text(hjust = 0.5)) +
ylim(0,500) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))

The boxplot displays the price distribution conditional on whether or not include that specific mean of transportation in the transportation description. Again, a boxplot further confirmed our previous insight. We can not find a difference of among those groups of listings that mentioned each transportation method or not.
SFPrice_Transit%>%
mutate(flag = fct_reorder(as.factor(Transit), desc(price), fun = median)) %>%
ggplot(aes(x =price, color = Transit)) +
geom_density(alpha = 0.5) +
ggtitle("Density Curve of Include Transit v.s not") +
theme(plot.title = element_text(hjust = 0.5)) +
xlim(0,950)

Price density grouped by whether do room owners include bus, bart, train, shuttle in their transportation description or not.We can not observe a significant difference between the listings that mentioned transit or not.
Most of people in California will drive a car to palces, and so does travelers, many of them will rent a car. In that case, a parking place would be important, how does that effect the price?
SFPrice_Parking%>%
mutate(flag = fct_reorder(as.factor(Parking), desc(price), fun = mean)) %>%
ggplot(aes(x =price, color = Parking)) +
geom_density(alpha = 0.5) +
ggtitle("Density Curve of Price v.s Parking Include or Parking Not Include") +
theme(plot.title = element_text(hjust = 0.5)) +
xlim(0,1200)+
ylab("Density") +
xlab("Price")

Price density grouped by whether do room owners include Parking in their transportation description or not. Again, we can not observe a significant difference between the two groups.
5. Executive Summary
Pricing strategy is a rather critical component for hotel and airline industries, corporations like Hilton and Marriot hire researchers in fields like operation research, industrial engineering, and even psychology to develop the dynamic pricing model to react on the market demand and elevate the profit. For hosts on Airbnb, they are not necessarily having a degree in those fields, how can they make an offer to make the most out of the market? Browsing other listings online seems a great idea, and Airbnb does have many filters to let people find rooms similar to theirs. However, there are so many components can affect the price and so many listing in cities like San Francisco, an average host doesn’t have enough statistical knowledge to have a clear idea about the what’s driving the price and make the best offer. Therefore, our goal is to present these drivers and help the hosts to better determine what they should offer.
Obviously, location is still the biggest drive for the price as we have assumed. The price goes higher for the districts by the northeast coast and downtown area, which follows the distribution in the real estate market, a higher valued house can charge more for room sharing. Since these houses or condos are sold for a higher price, we can infer these places have great views and luxury finish, and therefore can provide a better experience for the traveler.

The district with the highest overall price is 94104, a neighbor right by the financial district, it can be interpreted that people go to San Francisco for business trip want to live close to the companies they are going to, and they are willing to pay a higher price due to subsidiaries from their employer. For average travelers, this district is also in the heart of downtown San Francisco, which can provide easier transportation and access to bars and restaurants, the word cloud for descriptions of listings in this district below approved this point.

We can see these hosts are talking a lot about clubs, union square, and dining in their descriptions. These are the places people would love to walk rather than ride for half an hour for, and hence living close to it and enjoy, the demand. Therefore, driving the price up. If we trace back the development of city neighborhoods, that’s how different clusters are formed, people want to stay closer to the place that attracts them.
On the other hand, the most pricey areas are also very close to the famous attraction places in San Francisco such as Fisherman’s Wharf, Golden Gate Bridge, Lombard Street, and union square, which also drives the price up as we can see from the graph below Being away from attractions is preventing many districts to achieve a higher price.

It’s easier to understand that people not willing to travel far for clubs and restaurants. How about other places like attraction points, parks, and shopping mall? Do travelers want to spend time on the road? If that’s the case, listings with easier access to public transit would be more competitive and charge more. However,the fact is not, as we can observe from the graph below, the price distribution between listings mentioning one of the public transit method(Bus, Train, Bart, Shuttle) in their description or not at all, is rather subtle.

We know that public transportation in California is not so convenient, most people will drive to places and so does travelers. If the traveling people rented a car, they would need a place to park, so what about hosts mentioning parking?

Again, not so much difference. Therefore, this verifies our guess that when people travel to San Francisco, they do not want to spend much time on the road again, but rather stay closer to the places they plan to go.
Therefore, the most exciting finding for our analysis is that we revealed the more intrinsic relationships between price and location. Harold Samuel, a real estate tycoon in Britain, coined the expression: “There are three things that matter in property: location, location, location”. Our findings verified that this rule still holds in nowadays even with the unprecedented technological advancement in transportation.
Our suggestions to the hosts or people looking for house and plan to share rooms for profit in the future is that it is much better and competitive to being close to either attraction points for tourists or business districts for business travelers or clubs and restaurants for everyone is the best selling point. Take this advantage and otherwise, compete on price or service.